This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com or the YouTube tutorial (in french).
A cheat-sheet can be found here
R & the IDE RStudio can be install from this web page : https://posit.co/download/rstudio-desktop/
Cheat-sheet of RStudio can be found here.
Packages can be install with the instruction
install.packages() (Don’t forget the" "):
Example for the ggplot2 package:
# Install from CRAN
install.packages("ggplot2")
To see all the packages which are installed use the code below:
# Installed packages
installed.packages()
Use a package
After the installation of the desired package, it is neceesary to
upload it in R thanks to library command:
library("ggplot2")
library("tibble")
Remark: Some library needs to compilation, for this reason each library should be called one by one.
An other way consist to put all the library desired in a variable
thank to a vector (see below), then upload it with the function
lapply as it is showed below:
x <- c("readr","tibble","tidyr","ggplot2","dplyr", "factoextra",
"FactoMineR", "rgl", "colorspace","ggthemes","ggpubr", "readxl")
lapply(x,require,character.only = T)
Two ways:
setwd("url_of_the_directory")To check if we are in the good directory:
getwd() => returns the absolute filepath
representing the current directoryTo know the list of file and directory in the current directory:
list.files()Good resources: https://bookdown.org/ael/rexplor/
Be careful ==> R is case sensitive so A and a are different
In R, the variable can be assigned with two symbol <-
OR =. But the first one is the most used.
To compile a “list” of number, it is needed to create a vector thanks to
c() with , for the separation.
Then thanks to the function class you can see the class of
the variable.
x <- 20
l <- c(10, 44, 89)
s <- "I am a character"
bo <- TRUE # a booleen
paste("Variable x:",x)
## [1] "Variable x: 20"
class(x)
## [1] "numeric"
paste("Variable l:",l)
## [1] "Variable l: 10" "Variable l: 44" "Variable l: 89"
class(l)
## [1] "numeric"
paste("Variable s:", s)
## [1] "Variable s: I am a character"
class(s)
## [1] "character"
paste("Variable s:", bo)
## [1] "Variable s: TRUE"
class(bo)
## [1] "logical"
Be careful
Some variable names are already used by the software (examples: q, T, F,
D,I, var, mean…) ==> Do not use them !
The command paste() is used to concatenate string and
variable. A space is automatically include as separator. To change it,
add the argument , sep="" with no space or other symbol. It
is possible to used also paste0.
For any help, use the command help()or the symbol
?. Example:
help(class)
?ggplot2
To remove an object: rm(). At the begining of a
session/project, it can be usefull to clean all the data which was kept
by RSudio in its memory:
rm(list = ls())
The utilization of the vector c() do not create a real
list but a vector.
For a list, it is needed to call the function list:
li <- list(10, 44, 89)
li
## [[1]]
## [1] 10
##
## [[2]]
## [1] 44
##
## [[3]]
## [1] 89
class(li)
## [1] "list"
The vector can not create missed value because all that is useless is not created. Instead of, the list permit the creation with null value:
vector <- c(1, 5, NULL)
liste <- list(1, 5, "u", NULL)
vector
## [1] 1 5
liste
## [[1]]
## [1] 1
##
## [[2]]
## [1] 5
##
## [[3]]
## [1] "u"
##
## [[4]]
## NULL
Also the vector can combine different types of objects, whereas the vector can not combine numeric and character ==> All the element in the vector become a string:
vector <- c(1, 5, "u")
liste <- list(1, 5, "u")
vector
## [1] "1" "5" "u"
liste
## [[1]]
## [1] 1
##
## [[2]]
## [1] 5
##
## [[3]]
## [1] "u"
To generate a vector of the number between 2 limits (limits are
include): - vector1 <- c(50:100) -
vector1 <- seq(from=0, to=5, by=0.5) increment by
0.5
The vectors can be combine:
v1 <- rep(x = c(1,5), each=3) # repeat each time the value in the vector
v2 <- rep(x = c(1,5), time=3) # repeat each value 3 times
v3 <- c(v1, v2)
v1
## [1] 1 1 1 5 5 5
v2
## [1] 1 5 1 5 1 5
v3
## [1] 1 1 1 5 5 5 1 5 1 5 1 5
Use the function sample() to generate random values
lances <- sample(x=c("pile", "false"), size=10, replace=TRUE, prob=c(0.4,0.6))
lances
## [1] "false" "pile" "pile" "false" "false" "pile" "pile" "false" "false"
## [10] "false"
size give the number of trialreplace=TRUE allows sampling with replacement (by
default is a float number)prob allows to define the probabilities of appearance
of each value`Extracted vectors
In a vector, each element is located thanks to:
names(vector): The label must be define after the
creation of the vectornotes <- c(12,15,8,9,11)
names(notes) <- c("Villon", "Polin", "Exfi","Rotaf","Zerif")
Data extraction is done with the [] symbol:
notes[1]
## Villon
## 12
notes["Villon"]
## Villon
## 12
round(x, digits = y) ==> round the number x with
y digits, if you define the round function in the variable, the variable
keep the round number in memory
strsplit(x="pocket.365_rt", split="_") ==> Split
the string x in function of the split argument. !! DO NOT USE THE DOT
“.” to split, it’s a special character which delete every thing in the
string
nchar ==> number of character
toupper, tolower ==> Upper and Lower
the string
grep(x) ==> Give all the file with the
x inside the name
2 dimensions set of elements divide in row and columns.
- Matrix ==> accept an unique type for all columns - Data frame
==> accept different types of columns
L3 <- LETTERS[1:3]
fac <- sample(x=L3, size=10, replace=TRUE)
df1 <- data.frame(x = rep(x = 1, time=length(fac)), y=1:10, fac_letter=fac)
df1
The labeling of the rows and the columns can be done after using
colnames and rownames.
mat1 <- matrix(c (rexp(n=10, rate=2),
rexp(n=10, rate=2.5),
rexp(n=10, rate=1.8)),
nrow=5, ncol=3)
## Warning in matrix(c(rexp(n = 10, rate = 2), rexp(n = 10, rate = 2.5), rexp(n =
## 10, : data length differs from size of matrix: [30 != 5 x 3]
colnames(mat1) <- paste("cond", 1:ncol(mat1), sep="_")
rownames(mat1) <- paste("id", 1:nrow(mat1), sep="_")
mat1
## cond_1 cond_2 cond_3
## id_1 0.2493434 0.1687926 0.97938653
## id_2 0.2172397 2.2540553 0.49918140
## id_3 0.7634651 0.5221910 0.69246513
## id_4 0.3398403 0.1969599 0.48204954
## id_5 0.1129144 0.2931617 0.07123217
To import Excel File, the library readxl is needed to be
install.
csv files are directly import as a data.frame. A data set
can be displayed thanks to the command View().
smp <- read.csv2("~/code/R/FUN/fichier 1/smp1.csv")
View(smp) # With upper V
To obtain a particular value inside the data.frame, use
bracket where columns are separated by a comma:
df[index_ligne, index_column]
df1[4,2]
## [1] 4
df1[,"fac_letter"]
## [1] "B" "A" "A" "B" "B" "B" "B" "A" "C" "A"
df1$fac_letter # the same goal that the previous one but in the rare occasion can cause some issues
## [1] "B" "A" "A" "B" "B" "B" "B" "A" "C" "A"
df1[4,]
df1[c(2,5:9), 2]
## [1] 2 5 6 7 8 9
To see the 6 first observations for the all variables, use the
command head().
# Utilization of the Iris data set, directly included in R
class(iris)
## [1] "data.frame"
head(iris)
The command whichgive the position of observation who
respect the condition:
== for an equivalence!> for a difference<= / >= less/greater than or equal
to</ > strictly less/greater than& / || AND / OR ==> for the
multiple conditionsisTRUE(x) test if X is TRUEna.rm=TRUE to remove the empty valueswhich(iris$Species != "setosa")
## [1] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [19] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
## [37] 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
## [55] 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
## [73] 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## [91] 141 142 143 144 145 146 147 148 149 150
To know the exact value:
head(iris[which(iris$Sepal.Length>5),])
To see the unique value of a column, use the function
unique():
unique(iris$Species)
## [1] setosa versicolor virginica
## Levels: setosa versicolor virginica
length(unique(iris$Species)) # Give the number of unique values
## [1] 3
The command table group and count each observation for a
given variable.
The command
subset(data.frame, condition, c(vector of variable)), take
the observation who respect the condition and keep only the variables
asked.
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
table(iris$Species != "setosa")
##
## FALSE TRUE
## 50 100
table(subset(iris, Species == "setosa")$Petal.Length > 1.4)
##
## FALSE TRUE
## 24 26
head(subset(iris, Species == "setosa", c(Sepal.Length, Petal.Length)))
Upload the library tibble. This one allowed the creation
of a tibble (table) that it has the same purpose than a
data.frame but with more restriction. The tibble must
to have the same size for each column.
library(tibble)
# creation of the dataset
# m1 a list with Null values
m1<-list(1.311,1.287,1.293,1.308,1.291,1.300,1.274,1.287)
m1 <- append(m1, vector("list",5)) # append 5 element NULL contain in a list
m2<-c(1.298,1.309,1.293,1.251,1.338,1.302,1.270,1.339,1.346,1.292,1.291,1.321,1.285)
Si<-tibble("method1"=m1,"method2"=m2)
Use the functional programming
The functional programming is a paradigm of building computer program
with a declarative type. In this case, the programs are constructed by
applying and composing functions. In R the tibble library
(and other tixxx) allow this kind programming with the
following construction variable %>% function(). The
functional is a better way to write a comprehensive workflow:
# recover a list containing the values of method 1 padded with NULL values then converted to a vector, without NULL values
m1<-Si$method1 %>% unlist()
m1
## [1] 1.311 1.287 1.293 1.308 1.291 1.300 1.274 1.287
class(m1)
## [1] "numeric"
tidyr &
dplyrThese libraries are useful to manipulate a data set.
For example, in the first table, it is supposed that each samples have been measured with the both two methods:
In this case, measurements have been performed independently of the samples. So the first table does not work. Here it will be better that each value receive tag corresponding to its origin (method1 or method2). The code below propose this kind of table:
library(dplyr, tidyr)
##
## Attachement du package : 'dplyr'
## Les objets suivants sont masqués depuis 'package:stats':
##
## filter, lag
## Les objets suivants sont masqués depuis 'package:base':
##
## intersect, setdiff, setequal, union
Si2<-tibble(method=c(rep("method1",8),rep("method2",13)),values=c(m1,m2))
# Let's simulate a randomized acquisition of the dataset
Si2<-Si2[sample(nrow(Si2)),] # Indexes are generated by random pick, the tibble is shuffled
head(Si2)
# Transposition without some columns
Tab.transpo <- data.frame(t(Tab.Atranspo[,-c(1:3)]))
# Put the name of the line into the 1st column (String)
Tab.transpo <- rownames_to_column(Tab.transpo, "NameColumn")
# reformat as numeric
Tab.aDeriver$variable_for_derivation <- as.numeric(Tab.aDeriver$variable_for_derivation)
## Function of the first derivative
FD2 <- data.frame(apply(Tab.aDeriver[,-1],2,function(x){ #apply fwork as a loop, apply function for each column (1 = row)(= sample)
fd <- predict(sm.spline(Tab.aDeriver$variable_for_derivation,
x),Tab.aDeriver$variable_for_derivation,2) #do the first derivative for all values
return(fd) #returns a dataframe with values at each points
}))
Play with tidyr & dplyr
libraries and functional programming
This libraries allow to make a code more “readable” thanks to
%>% symbols.
m1b<-Si2 %>% filter(method == "method1") %>%
select(values) %>%
unlist()
m1
## [1] 1.311 1.287 1.293 1.308 1.291 1.300 1.274 1.287
m1b
## values1 values2 values3 values4 values5 values6 values7 values8
## 1.300 1.311 1.287 1.287 1.291 1.308 1.293 1.274
Cheat-sheet for more information on those libraries:
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
A rapid view of the global statistic descriptor with the function
summary():
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Note: iris is a dataset which is already include in R
sum(iris$Sepal.Length) # Addition of all the observation in the variable
## [1] 876.5
mean(iris$Sepal.Length) # Average
## [1] 5.843333
sd(iris$Sepal.Length) # Standard deviation
## [1] 0.8280661
q <- quantile(iris$Sepal.Length) # Catch all the quantiles
q
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
q[1]
## 0%
## 4.3
length(iris$Sepal.Length) # Number of observation in the variable
## [1] 150
The confidence interval at 95% : \(IC=\bar{x}\pm1,96*se\) with:
qnorm(y) & qnorm(1-y)y <- (1-0.95)/2
y
## [1] 0.025
signif(qnorm(y),3)
## [1] -1.96
signif(qnorm(1-y),3)
## [1] 1.96
The standard error is the standard deviation of the mean of distributions and the standard deviation is for the individual distribution of a normal law.
x <- iris$Sepal.Length
x_N<-length(x)
x_sd<-sd(x)
x_mu<-mean(x)
x_se<-x_sd/sqrt(x_N)
x_ICmin<-x_mu+x_se*qnorm(0.025)
x_ICmax<-x_mu+x_se*qnorm(0.975)
paste(signif(x_ICmin,3),"<",signif(x_mu,3),"<",signif(x_ICmax,3))
## [1] "5.71 < 5.84 < 5.98"
The command signif(variable, number) print the
variable with the number of significant value
asked.
With R, it is easy to evaluate a probability law:
With x & q vector of values to be evaluate by mylaw. Can be just one value for one evaluation, several values or an interval to draw a graph for exemple. p is the probabilitie of mylaw. And n the number of observations to be generate.
Below the 3 mains laws:
The Normal law is used to described continuous variables (time,
temperature…) where is impossible to predict the value k of
a variable X with exactitude. The reason why is the fact
that the possibility to give precisely the right value is closed to 0
because the infinity of possibility. For example, it is impossible to
predict that temperature in the room 401N will be exactly 20°C because
there an infinity of possibility between 19.9°C and 20.1°C. However the
the comportment of the variable X can be described thanks
to the probability density function (PDF). With this
description, it is possible to deduce the probability in a given
interval which is represented by the area under the PDF
curve between this interval. This probability where X is
include in [a,b] interval, \(\mathbb{P}(X\in[a,b])\), is given by the
cumulative density function (CDF) which is the
primitive of the PDF:
Where \(µ\) is the mean and \(\sigma\) is the standard deviation which describe the normal law.
my_mu<-20 # mean value of the normal law
my_sg<-sqrt(20) # std dev of the normal law
dnorm(20,mean=my_mu,sd=my_sg) # point evaluation of the density at x=20 of a normal law centered at my_mu and width sigma=my_sg
## [1] 0.08920621
pnorm(20,mean=my_mu,sd=my_sg) # point evaluation of the CDF at q=20 of this same normal distribution
## [1] 0.5
qnorm(0.05,mean=my_mu,sd=my_sg) # quantile of this normal distribution considering a probability p=0.05
## [1] 12.64399
rnorm(10,mean=my_mu,sd=my_sg) # generate 10 random numbers following this normal distribution
## [1] 25.49505 14.72169 16.11864 22.73180 18.75411 24.97030 12.22927 22.98842
## [9] 18.63697 16.87710
The Cumulative Density Function has remarkable properties:
One day may be
The Binary law can be positive or negative, but in the general cases, it is only the positive which is used. This is the law which is followed by a randomization selection when there are ONLY TWO possibility of results: A or B, alive or died. The probability to obtained each event must be constant (but not necessary equal). The probability to obtain the A event will be \(p\) and B will be \(q=(1-p)\). So, with \(n\) selection, the probability to obtain k event of A \(\mathbb{P}(X=k)\) will follow the binomial law \(B(n,p)\). The mean \(\mu\) correspond to the expectation \(\mathbb{E}\) (espérance in french) and will be give by \(\mu=np\). the variance is given by \(var=npq\).
n<-30 # population
p<-0.1 # probability of success
dbinom(x=1,size=n,prob=p) # Probability that I win exactly 1 time in 30 plays
## [1] 0.1413039
pbinom(q=2,size=n,prob=p) # CDF, probability to win 0,1 or 2 times in 30 plays
## [1] 0.4113512
qbinom(p=0.5,size=n,prob=p) # Quantile of 0.5 from the binomial law
## [1] 3
rbinom(n=10,size=n,prob=p) # Sample of size 10 from the binomial law
## [1] 0 6 6 4 4 8 3 3 1 5
The formula of the binomial law is:
\[\mathbb{P}(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]
Where \(\binom{n}{k}\) is the
binomial coefficient (n choose k) and can be calculate with the command
choose(n,k) or by \(\binom{n}{k}
=C_n^{k}= \frac{n!}{k!(n-k)!}\)
The Poisson law is used to described the probability of a rare event in an interval of time. It is an approximation of the binomial law when the when \(p\) tends to zero (very law probability) and \(n\) tends to \(\infty\) (a lot of repetition). This law take the parameter \(\lambda = np\) and traduce the the expectation \(\mathbb{E}\) and the variance (which is logical since \(p\rightarrow0\Rightarrow q\rightarrow1\) in the binomial law). That’s why the Poisson law is define by \(P(\lambda)\) and represent the mean of the probability where the rare event \(k\) is realized.
n<-30 # number of observation
m <- 20 # Poisson law parameter
dpois(x=1, lambda =m ) #
## [1] 4.122307e-08
ppois(q=2, lambda =m ) #
## [1] 4.55515e-07
qpois(p=0.5, lambda =m ) #
## [1] 20
rpois(n=n, lambda =m ) #
## [1] 25 19 17 26 21 21 17 19 22 23 23 15 19 16 17 20 12 14 22 21 19 25 22 25 21
## [26] 23 17 18 17 24
The formula of the Poisson law is: \[\mathbb{P}(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}\]
Note: factorial calculation in R:
factorial(5)
## [1] 120
It is possible to measurement of the shape of the distribution thanks
to library moment:
library("moments")
The Skewness moment gives some information about the symmetry of the distribution:
virginica <- subset(iris, Species == "virginica")
skewness(virginica$Sepal.Length, na.rm = TRUE)
## [1] 0.1144447
Which can be see in this graph with the mean in red:
The Kurtosys moment measure how the tails is heavy or flat.
virginica <- subset(iris, Species == "virginica")
kurtosis(virginica$Sepal.Length, na.rm = TRUE)
## [1] 2.912058
Some graphics can be use in R without package. But they are limited
and don’t have a lot possibility to manage them. Those graphs can be
useful for a first and fast exploration of the data but for a
presentation, it must be interested to create more complex/beautful
graphs. That why it is better to work with the package
ggplot2.
First of all, install the package ggplot2 then upload it
in R thanks to library command:
library(ggplot2)
ggplot2 is defined at least by a data
set, some aes, which represent the “aesthetic”
of the graphic, and the type of the graphic thanks to the function
geom.. The aes is defined at least by
a x and a y values. It’s possible to
add also the color, the fill, the
group. Some example will be showed below.
WARNING:
All the addition of function is add thanks the
+
For more information: https://bookdown.org/ael/rexplor/chap8.html (in French)
Graphic can caught in a variable but it is recommended to have one and unique block per graphic:
# Determination of the limits
alpha=0.0001 # kind of "type I risk", the part of the plot I want to hide on the left/right sides.
xmin=qnorm(alpha,mean=my_mu,sd=my_sg) # left value of the range to plot the distribution
xmax=qnorm(1-alpha,mean=my_mu,sd=my_sg) # right value of the range to plot the distribution
# A beautiful plot of this normal law CDF
ggplot()+ # start plot
xlim(xmin,xmax)+ # define the range to plot
stat_function(fun=pnorm,args=c(mean=my_mu,sd=my_sg))+ #Define the CDF of a normal law
stat_function(fun=pnorm,args=c(mean=my_mu,sd=my_sg),geom="area",xlim=c(xmin,20))+ # define the normal law CDF, plot as an area under the curve and plot until the value 15
xlab("x")+ # decoration of the x axis
ylab("P(X<=x)")+ # decoration of the y axis
labs(title=paste("Normal law CDF - mean=",my_mu,", std dev=",signif(my_sg,3)))+ # title of the plot
theme_bw(base_size=12)+ # ...the final touch
theme(plot.title = element_text(hjust = 0.5, color = "blue"))
Here stat_function(fun=pnorm,args=c(mean=my_mu,sd=my_sg)
is used to draw a function (the CDF of a normal law in this
case). The parameter fun= gives the function to draw,
and args= take the parameters of this function as
arguments. Then a second curve is added with the parameter
geom to display the area under the curve.
plot(Ca,
type = "p", # type de tracé: points ("p"), lignes ("l"), les deux ("b" ou "o"),
col = "blue", # couleur, tapez `colours()` pour la liste complète
pch = 4, # type de symboles, un chiffre entre 0 et 25, tapez `?points`
cex = 0.5, # taille des symboles
lty = 3, # type de lignes, un chiffre entre 1 et 6
lwd = 1.2, # taille de lignes
ylab = "Ca content", # titre pour l'axe des y
main="Exercie 3 of the Statistic lecture")
abline(h=mean(Ca)+3*sd(Ca), col="red") # Add a line on the graphic
text(15,200, "3 * Standard Deviation", col="red") # Add a text on the graphic
abline & text functions can add a line
and a text respectively in the graphic.
With ggplot2 use geom.point():
# Example of data frame for a binomial law
df_binom<-data.frame(x=c(0:n),p=dbinom(x=c(0:n),size=n,prob=p))
# Genaration of a graph
ggplot(data=df_binom)+
aes(x=x,y=p)+
geom_point()+
geom_segment(aes(x=x,y=0,xend=x,yend=p))+
theme_light()+
xlab("Number of success, x")+
ylab("Probability, p")+
labs(title=paste("Density of the binomial law, size=",n,", prob=",p))
Basic R code for to generate a barplot of a Poison law:
# Poisson distribution parameter
lambda <- 3 # Lambda parameter
# Generate values from the Poisson distribution
x <- c(0:10) # Possible values (adjust as needed)
probas <- dpois(x, lambda = lambda) # PDF
# Create the barplot
barplot(probas, names.arg = x, col = "skyblue", main = "Poisson Distribution",
xlab = "Number of events", ylab = "Probability", ylim = c(0, max(probas) + 0.1))
With ggplot2 use geom.bar():
ggplot(iris, aes(x = Species, y = Sepal.Length))+
geom_bar(stat = "identity")+
theme_test()
It is possible to have multigroup in the barplot. It’s just needed to
add the parameter fill = group2 in the aes().
By default the group are stacked. To have the barplot side by side, add
position=position_dodge() in the geom_bar
parameter.
More complicate, the case of 2 continue variables y. It is not
possible at this step with geom_bar function, which allow
only one x for one y. To solve this issue, it’s needed to reorganized
the data to transform a table with y columns to a table with only two
columns. This transformation can be done with the function
gather(data, key, value, columns_to_gather) of the library
tidyr.
,# Formatage of the data thanks to the library tidyr
library(tidyr)
reformat <- gather(iris, key = "Part_flower", value = "Length", Sepal.Length, Petal.Length)
head(reformat)
# Work on the mean and Standard deviation
reformat.summary <- reformat %>%
group_by(Part_flower, Species) %>%
summarise( sd = sd(Length), len = mean(Length))
## `summarise()` has grouped output by 'Part_flower'. You can override using the
## `.groups` argument.
reformat.summary
# Plot
ggplot(reformat.summary, aes(x = Species, y = len, fill=Part_flower))+
geom_bar(stat = "identity", position=position_dodge(0.8), width = 0.7)+
geom_errorbar(
aes(ymin = len-sd, ymax = len+sd, group = Part_flower),
width = 0.2, position = position_dodge(0.8)
)+
labs(title="Average of the Petal and Sepal Lenght in function of the Species")+
theme_test()
More inspiring barplot : https://www.datanovia.com/en/fr/lessons/ggplot-barplot/
Basic R code for to generate an histogram of a nominal law:
hist(rnorm(100,mean=my_mu,sd=my_sg), freq=F)
curve(dnorm(x,mean=my_mu,sd=my_sg),from = xmin,to = xmax,ylab="densité",add=T,col="red")
With ggplot2 use geom.histogram():
#data <- data.frame(x = rnorm(100,mean=my_mu,sd=my_sg))
ggplot(iris, aes(x= Sepal.Length))+
geom_histogram(aes (y=after_stat(density),color=Species, fill = Species), #after_stat(density) normalize to 1
bins = 50,
position = "identity",
alpha = 0.4,
)+ # bins number of interval un the histogram
geom_density(aes(color = Species), linewidth = 1)+#show the density of each disribution
stat_function(fun=dnorm,args=c(mean=mean(iris$Sepal.Length),sd=sd(iris$Sepal.Length)), #show the normal law theortic for the global distribution
color ="red",
linewidth = 0.1)+
scale_color_manual(values = c("#00AFBB", "#E7B800","pink"))+ #use personalized color
scale_fill_manual(values = c("#00AFBB", "#E7B800","pink"))+ #use personalized color
theme_bw()
The boxplot can represent the distribution of the observation in function of the group. It gives also the quantile, the mean and the median information.
The easyiest way to have a boxplot is to use the basic R command:
boxplot(quantitative_variable~quatative_variable, labels).
Between the quantitative_variable & the quantative_variable there is
a (~).
Example:
boxplot(iris$Sepal.Length~iris$Species,ylab="Sepal Length",xlab="Species")
With ggplot2 use geom.boxplot():
p <- ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species, color = Species))+
ylim(4,8)+
ylab("Sepal Length")+
geom_boxplot(alpha=0.5, # alpha for the transparence,
outlier.colour="red",
outlier.shape=8,outlier.size=4 # Note, outlier are shown twice
#outlier.alpha=0 to not represent the outlier twices
) +
stat_summary(fun = mean, geom="point", shape=18, size=5, color="white") +
geom_point(position="jitter", alpha=.5)+
theme_classic()
p
The function stat_summary(fun.y = mean) display the mean
value.
As it is showed, it is possible to combine several graphic in one by
addition of goem functions. But sometimes, it is needed to
separate the graphics. It is called the faceting. Two function can be
used:
facet_warp(qualitative_variable, nrow = number of ligne)facet_grid(ligne ~column)ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length, shape = Species, color=Species))+
theme_bw()+
geom_smooth(method="lm", formula= y~poly(x,3), se=TRUE, size = 0.5) +
geom_point() +
facet_wrap(iris$Species, nrow = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length, shape = Species, color = Species))+
theme_bw()+
geom_point() +
facet_grid(~Species)
In the first facing graph, the function
geom_smooth(method="lm", formula= y\~poly(x,3), se=TRUE, size = 0.5)
has been introduced. This function is used to add a smooth regression
curve to show the general trend of the dataset. Here the method used to
draw this curve is a fitting linear model lm.
For more inspired graphs:
https://rstudio-pubs-static.s3.amazonaws.com/578122_5e69256788bb4dcca6157d2bcfa7694e.html
https://www.charlesbordet.com/fr/faire-beaux-graphiques-ggplot2/#it%C3%A9ration-4---ajouter-des-couleurs-pour-chaque-groupe
First we need to check if the distribution of the variable follow a normal law - which is the H0 hypothesis - is valid.
A rapid and graphical test is to check the Q-Q plot (droite d’henry):
qqnorm(iris$Sepal.Length, main = "Normal Q-Q Plot of the Global Sepal Length distribution")
qqline(iris$Sepal.Length, col="red")
Also with ggplot:
ggplot(iris, aes(sample = Sepal.Length, colour = Species)) +
stat_qq() +
stat_qq_line()+
theme_light()+
xlab("Samples Quantiles")+
ylab("Theorical Quantiles")+
labs(title="Normal Q-Q Plot of the Sepal Length distribution for each Species")
The visual data analysis is not sufficient to proof the normality of the distribution. It is mandatory to complete by a statistical test in order to check the validation of the H0 hypothesis. If the test is significant (p-value < 0.05), the distribution is not normal. So the failure of the Significant test (p-value > 0.05) results on the normality of the distribution.
The normal test is sensitive to the sample length. Small effective usually pass the test. That why, both the visual inspection AND the normal test must to be performed.
Remark:
In reality, if the H0 hypothesis is NOT rejected, we can not affirm
that the sample follow a normal law. Indeed, it is impossible to
validate at 100% a theory because there are too many possibility.
However, if the test is rejected in the given condition, this fact is
certain
In R, this test is performed with the command
shapiro_test() include in the rstatix package
directly installed with the classic distribution of R.
For One variable
library(rstatix)
##
## Attachement du package : 'rstatix'
## L'objet suivant est masqué depuis 'package:stats':
##
## filter
shapiro_test(iris, Sepal.Length)
Here p-value > 0.05 ==> The test is not significant ==> H0 is validate. The global distribution of the Sepal length follow a normal law.
Variables classify by group
iris %>%
group_by(Species) %>%
shapiro_test(Sepal.Length)
Note, here the `tidyverse` library is use in order to
manipulate the data
For multiple variables
shapiro_test(iris, Sepal.Length, Petal.Width)
Here the p-value of the Petal width is < 0.05 ==> Significant test ==> H0 rejected. The global distribution of the Petal width do not follow a normal law. This is can be checked by a simple Q-Q plot:
Here the normality test is a particular case of the test.
In general this test enable to check the hypothesis that a sample follows of given CDF (normal, binary…) or if two samples follow a same law (normal, binary…).
1) Creation of samples
a<-rnorm(100,mean=0,sd=1) # a normal law with 100 variables
b<-rgamma(100,shape=1,rate=0.8) # a other law
c<-rnorm(50,mean=0,sd=1) # the same normal law of a
d<-rnorm(50, mean=2, sd=1) # a second normal law but not with the same CDF
2) Do a & b follow the same law ?
ks.test(a,b)
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: a and b
## D = 0.47, p-value = 5.099e-10
## alternative hypothesis: two-sided
The p-value is very weak (<0.05) ==> Significant test ==> H0 rejected ==> The samples do not follow the same law
2) Does a follow a normal law ?
ks.test(a,"pnorm") # Any normal law
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: a
## D = 0.087583, p-value = 0.427
## alternative hypothesis: two-sided
ks.test(a,"pnorm",0,1) # A particular normal law define by a mean 0 & a sd 1
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: a
## D = 0.087583, p-value = 0.427
## alternative hypothesis: two-sided
ks.test(a,"pnorm",2,1) # A particular normal law define by a mean 2 & a sd 1
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: a
## D = 0.69373, p-value < 2.2e-16
## alternative hypothesis: two-sided
For the two first p-value > 0.05, the test is not significant ==> H0 validate . The last one, p-value < 0.05 ==> a does not follow this particular law.
3) Do a, c & d follow the same normal law ?
ks.test(a,c)
##
## Exact two-sample Kolmogorov-Smirnov test
##
## data: a and c
## D = 0.16, p-value = 0.3503
## alternative hypothesis: two-sided
ks.test(a,d)
##
## Exact two-sample Kolmogorov-Smirnov test
##
## data: a and d
## D = 0.79, p-value < 2.2e-16
## alternative hypothesis: two-sided
a & c follow the same normal law. Not d.
This test is performed to compare the equality of two variances. The H0 hypothesis is that \(H_0:\sigma^{2}_{1}=\sigma^{2}_{2}\).
The F-test of fisher is given by the formula below (with \(DF\) degree of freedom): \[ F = \frac{\frac{\sigma_1^2}{DF_1}}{\frac{\sigma_2^2}{DF_2}} \]
setosa <- subset(iris, Species == "setosa")
virginica <- subset(iris, Species == "virginica")
versicolor <- subset(iris, Species == "versicolor")
var.test(setosa$Sepal.Width, virginica$Sepal.Width,conf.level = 0.95) # with a bilateral risk of 5%
##
## F test to compare two variances
##
## data: setosa$Sepal.Width and virginica$Sepal.Width
## F = 1.3816, num df = 49, denom df = 49, p-value = 0.2614
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.7840128 2.4346017
## sample estimates:
## ratio of variances
## 1.381578
Here the p-value is superior to 0.05: H0 is not rejected.
var.test(virginica$Sepal.Length, virginica$Sepal.Width,conf.level = 0.95) # with a bilateral risk of 5%
##
## F test to compare two variances
##
## data: virginica$Sepal.Length and virginica$Sepal.Width
## F = 3.8878, num df = 49, denom df = 49, p-value = 5.014e-06
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 2.206211 6.850965
## sample estimates:
## ratio of variances
## 3.88776
Here the p-value is largely under 0.05. The H0 hypothesis is rejected, that mean that the variance between this two categories are different.
WARNING:
This test should not be confused with Fisher’s exact test or the Fisher
test for the independence, which is performed to compare the dependence
of several groups:
This test is performed to compare the equality of two means. The H0
hypothesis is that \(H_0:µ_{1}=µ_{2}\).
In other words, are the populations from which the samples were taken
the same?.
This test need 3 conditions:
var.equal = TRUEThen we can performed the t test.
t.test(setosa$Sepal.Width, virginica$Sepal.Width,conf.level = 0.95, var.equal = TRUE)
##
## Two Sample t-test
##
## data: setosa$Sepal.Width and virginica$Sepal.Width
## t = 6.4503, df = 98, p-value = 4.246e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3143257 0.5936743
## sample estimates:
## mean of x mean of y
## 3.428 2.974
Here the p-value is largely under 0.05. The H0 hypothesis is rejected, that mean that the means of this two categories are different.
If the variances between the two is not equal ==> Use the
Welch approximation with
var.equal = FALSE
t.test(virginica$Sepal.Length, virginica$Sepal.Width, conf.level = 0.95, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: virginica$Sepal.Length and virginica$Sepal.Width
## t = 35.842, df = 72.643, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.413027 3.814973
## sample estimates:
## mean of x mean of y
## 6.588 2.974
When the distribution doesn’t follow a normal law, or if n < 8, it is possible to use the Wilcoxon test:
wilcox.test(iris$Petal.Length, iris$Petal.Width, conf.level = 0.95)
##
## Wilcoxon rank sum test with continuity correction
##
## data: iris$Petal.Length and iris$Petal.Width
## W = 19349, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Note that the Wilcoxon test does not give the resultats over the mean but by ranking the variables
Outlier can be detected with barplot. However, it can be performed
with a statistical test. In R, the detection of outliers is performed
thanks to library outliers.
The Dixon test can be used for the small samples and evaluate the outliers by comparison of distances a between a suspect point and the measurement nearest to it in size with the range of the measurements. Thus, the H0 hypothesis is: the value is not an outlier.
##
## Dixon test for outliers
##
## data: data_for_dixon
## Q = 0.96277, p-value < 2.2e-16
## alternative hypothesis: highest value 200 is an outlier
The Dixon test is performed one by one, that means if there are 2 or more variables, the test must be performed until that the p-value is higher than 0.05.
# dataset
data_for_dixon <- c(-1000, 10, 12, 13, 14, 15, 16, 17, 18, 19,200.5)
# initialization of pvalue to enter into the loop
pvalue <- 0
# loop while until the p-value is > 0.05
while(pvalue < 0.05){
#test
result_dixon <- dixon.test(data_for_dixon)
pvalue <- result_dixon$p.value # catch the p-value in variable
# if pvalue is under 0.05 ==> remove the outlier
if (pvalue <0.05){
# catch the outlier in variable:
# $alternative is a chain of character
# Split into a list and select the 3erd word
# Convert into a numeric value
val_outlier <- as.numeric(strsplit(result_dixon$alternative, " ")[[1]][3])
# Remove the outlier
data_for_dixon <- data_for_dixon[-which(data_for_dixon==val_outlier)]
}
# Printing
print(result_dixon)
print(data_for_dixon)
}
##
## Dixon test for outliers
##
## data: data_for_dixon
## Q = 0.99313, p-value < 2.2e-16
## alternative hypothesis: lowest value -1000 is an outlier
##
## [1] 10.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 200.5
##
## Dixon test for outliers
##
## data: data_for_dixon
## Q = 0.96286, p-value < 2.2e-16
## alternative hypothesis: highest value 200.5 is an outlier
##
## [1] 10 12 13 14 15 16 17 18 19
##
## Dixon test for outliers
##
## data: data_for_dixon
## Q = 0.25, p-value = 0.7435
## alternative hypothesis: lowest value 10 is an outlier
##
## [1] 10 12 13 14 15 16 17 18 19
When the dataset has more than 30 values, the Dixon test does not work:
dixon.test(versicolor$Sepal.Length)
## Une erreur s'est produite : Sample size must be in range 3-30
In this case, the Grubbs test is more useful. The test is also performed one by one. So to remove all the outliers, we need to process by iteration like the Dixon test before.
grubbs.test(versicolor$Sepal.Length)
##
## Grubbs test for one outlier
##
## data: versicolor$Sepal.Length
## G = 2.06133, U = 0.91151, p-value = 0.8977
## alternative hypothesis: highest value 7 is an outlier
grubbs.test(virginica$Sepal.Length)
##
## Grubbs test for one outlier
##
## data: virginica$Sepal.Length
## G = 2.65459, U = 0.85325, p-value = 0.1509
## alternative hypothesis: lowest value 4.9 is an outlier
boxplot(virginica$Sepal.Length)
Remark: Here the Grubbs test does not find outlier, whereas the box
plot yes. In fact outlier in boxplot are not performed by a true
statistical test. That’s why, in case of doulbt, it is mandatory to
performe a test.
The analysis of variance. The purpose of this test is to find which factor has an impact on the variance. This analysis is based on the test of Fisher. This is closed to the F-test of variance. Except that the comparison is not performed on two individual groups but compare the variance between all the group over the variance inside one group:
\[ F =
\frac{\frac{\sigma_{inter}^2}{DF_{inter}}}{\frac{\sigma_{intra}^2}{DF_{intra}}}
\]
With \(\sigma_{inter}^2\) the
variance inter (betwenn) group: \(\sigma_{inter}^2=\sum(mean_{group}-mean_{total})^2\)
And \(\sigma_{intra}^2\) the variance
in a particular group \(\sigma_{intra}^2=\sum(x_{i}-\overline{x_{group}})^2\)
ANOVA assumes that the observations are independent, follow a normal law and there is no significant outlier. The H0 hypothesis:
ANOVA, in R, is performed thanks to the command
aov(observation~factor, dataset). The results can be shown
thanks to the command summary.aov.
species_anova <- aov(Sepal.Length~Species, iris)
summary.aov(species_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here the p-value is under 0.05. That means that the H0 is rejected. The factor Species has significant impact on the Sepal length.
For this example, the dataset ToothGrowth is used, and
it is assume that all the condition for the ANOVA are OK. This dataset
examines the tooth growth length in guinea pigs based on the dosage and
supplementation of vitamin C.
The two-way ANOVA, in R, is performed with the same command than the
one-way. The interaction between the two factors are add with the symbol
* between the 2 factors:
aov(observation~factor1*factor2, dataset).
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
ToothGrowth$dose <- as.factor(ToothGrowth$dose) # dose is numeric ==> Conversion as a factor
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 0.5:20
## 1st Qu.:13.07 VC:30 1 :20
## Median :19.25 2 :20
## Mean :18.81
## 3rd Qu.:25.27
## Max. :33.90
# 2-way ANOVA
tooth_anova <- aov(len~supp*dose,ToothGrowth)
summary.aov(tooth_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## supp 1 205.4 205.4 15.572 0.000231 ***
## dose 2 2426.4 1213.2 92.000 < 2e-16 ***
## supp:dose 2 108.3 54.2 4.107 0.021860 *
## Residuals 54 712.1 13.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here the p-value is under 0.05 for both supplementation and the dosage of vitamin C. That means that the H0 is rejected. This two factors have a significant impact. In the same time, the impact of the interaction between this two factor is also significant (p-value <0.05).
The Two-way ANOVA can also be performed without cross interaction. In
this case the symbol * is replaced by +
# Step 1 creation of the theme
theme_set(
theme_classic() +
theme(legend.position = "top")
)
my3cols <- c("#2E9FDF", "#FC4E07", "#E7B800")
# Prepar
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
df.summary <- ToothGrowth %>%
group_by(dose,supp) %>%
summarise(sd = sd(len, na.rm = TRUE), len = mean(len))
# Step 2 Creation of the different graphs
# 1- The box plot
bxp <- ggplot(ToothGrowth, aes(x = dose, y = len))+ geom_boxplot(aes(color = supp)) +
scale_color_manual(values = my3cols)
# 3- line plot
lin <- ggplot(df.summary, aes(dose, len)) +
geom_line(aes(linetype = supp, group = supp, color=supp))+
geom_point()+
geom_errorbar(
aes(ymin = len-sd, ymax = len+sd, group = supp, color=supp),
width = 0.2
)+
scale_color_manual(values = my3cols)
# STeo 3 pool the graphs together
library("ggpubr")
## Warning: le package 'ggpubr' a été compilé avec la version R 4.3.2
figure <- ggarrange(bxp, lin,
labels = c("A", "B"),
ncol = 2, nrow = 1,
common.legend = TRUE,
legend = "right")
figure
The PCA is a unsupervised analysis which is used to reduce the number of the variables. Indeed, with more than 3 observation, it’s become complicated to represent graphically the information, thus the goal of the PCA is to reduce the number of dimension in order to have a better visualization.
Two packages are recommended to used (developed by french
people):
- FactoMineR -> For the statistical methods -
factoextra -> For the visualization
library("FactoMineR")
## Warning: le package 'FactoMineR' a été compilé avec la version R 4.3.2
library("factoextra")
## Warning: le package 'factoextra' a été compilé avec la version R 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library("ggpubr")
To work on the PCA, the dataset decathlon is used
(created to this purpose).
decathlon<-read.table("https://r-stat-sc-donnees.github.io/decathlon.csv",
header=TRUE,sep=";",dec=".",
row.names=1,check.names=FALSE,fileEncoding="latin1")
head(decathlon)
# Make the PCA
pca.decat <- PCA(decathlon[,-c(11:13)], graph=FALSE) # keep only the variables of interest
Then make the graph with fviz_pca_ind
graph1 <- fviz_pca_ind(pca.decat, axes = c(1,2), #plot with the dimension selected
pointshape = 20,
pointsize = 4,
geom.ind = "point",
mean.point = F, #hide the point of the mean
col.ind = decathlon$Competition,
addEllipses = FALSE)
ggpar(graph1, title = "PCA of the Decathlon Dataset",
subtitle = expression(paste("Plot by Individue", sep = "")), #expression(paste) => permet de mettren en indice [] ou exposant ^
caption = "",
legend = "right",
legend.title = "Competition")
It could be interesting to see the scree plot
# Scree plot
fviz_eig(pca.decat, addlabels =TRUE)
3D plot
library("rgl")
# Convert the group as a vector
decathlon$Competition <- factor(decathlon$Competition)
# 3 D dans R
plot3d(pca.decat$ind$coord[,1],
pca.decat$ind$coord[,2],
pca.decat$ind$coord[,3],
xlab="Dim1",ylab="Dim2",zlab="Dim3",
col = rainbow(length(levels(decathlon$Competition))),
type="s",radius=0.1)
# Add legend
legend3d("topright", legend = levels(decathlon$Competition),
col = rainbow(length(levels(decathlon$Competition))),
pch = 16, cex = 1.2)
Remark a new windows will appears.
With the code below and a GIMP software - or with online GIF creator
website like here
- it is possible to create an animation of the 3D picture. The 3D
plot above need to be done before
setwd("C:/Users/hpier/Documents/code/R/ChemoinfoR_tuto/3D")
for (i in 1:360) {
rgl.viewpoint(i, 20) ; ;
filename <- paste("pic", formatC(i, digits = 1, flag = "0"), ".png", sep = "")
rgl.snapshot(filename) } ;
getwd()
Remark: rgl.viewpoint is obsoleted
A linear regression \(y=f(x)\) can
be implemented thanks to the following function:
lm(y~x, data=my_dataset). It is based on the least square
method.
To force the curve to trough the origin, it just needed to add the value
-1after the x data.
my_model <- lm(data=iris, Sepal.Length~Sepal.Width)
summary(my_model)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5561 -0.6333 -0.1120 0.5579 2.2226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5262 0.4789 13.63 <2e-16 ***
## Sepal.Width -0.2234 0.1551 -1.44 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8251 on 148 degrees of freedom
## Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
## F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
# Force trough to the origin
my_model0 <- lm(Sepal.Length~Sepal.Width-1, data=iris)
summary(my_model0)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width - 1, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5236 -1.0362 0.4823 0.9897 2.8406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Sepal.Width 1.86901 0.03265 57.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.235 on 149 degrees of freedom
## Multiple R-squared: 0.9565, Adjusted R-squared: 0.9562
## F-statistic: 3277 on 1 and 149 DF, p-value: < 2.2e-16
t value = estimate / standard error => must be high
Pr(>|t|) => Probability to have the real value is
upper the t value => must be small.
Again we can look at the p-value.
In this case the my_model0 is better than my_model1.
plot(Sepal.Length~Sepal.Width, data = iris, xlim=c(-10,5), ylim=c(0,10))
abline(my_model, col="red")
abline(my_model0, col="blue")
# Catch the stats of the model in a variable
Stat_myModel<-summary(my_model)
# Catch the interception (b) and the standard error
interception <- Stat_myModel[["coefficients"]][1,1]
error_std <- Stat_myModel[["coefficients"]][1,2]
# Analysis of the confidence interval
t95<-qt(0.975,2) # Student's bilateral quantile at 5% risk
ICmin_int<-interception - t95 * error_std # lower bound of IC(95) of intercept
ICmax_int<-interception + t95 * error_std # upper bound of IC(95) of intercept
# Print results
print(paste("Interception =",signif(interception,3), "in the interval of [",signif(ICmin_int,3),";",signif(ICmax_int,3),"]."))
## [1] "Interception = 6.53 in the interval of [ 4.47 ; 8.59 ]."
To predict new \(Y\) from new value \(x\)
p1 <- predict.lm(my_model, tibble("Sepal.Width"=c(2.7)))
p0 <- predict.lm(my_model0, tibble("Sepal.Width"=c(2.7)))
print(paste("Model 1: Y_pred =", signif(p1,3),
" | Model 2 (trough 0); Y_pred =",signif(p0,3)))
## [1] "Model 1: Y_pred = 5.92 | Model 2 (trough 0); Y_pred = 5.05"
To predict new \(x\) from \(Y\), is not possible to do it directly with
predict.lm. Calculate x from the coefficient following this
equation: \(Y=ax+b <=>
x=\frac{Y-b}{a}\)
coef_model <- coef(my_model)
coef_model0 <- coef(my_model0)
Y_new <- 8.3
x_pred_1 <- (Y_new-coef_model[1])/coef_model[2]
x_pred_2 <- (Y_new)/coef_model0[1]
print(paste("Model 1: x_pred =", signif(x_pred_1,3),
" | Model 2 (trough 0); x_pred =",signif(x_pred_2,3)))
## [1] "Model 1: x_pred = -7.94 | Model 2 (trough 0); x_pred = 4.44"
for (i in 1:5){
print(paste("Lign#", i, sep=" "))
}
## [1] "Lign# 1"
## [1] "Lign# 2"
## [1] "Lign# 3"
## [1] "Lign# 4"
## [1] "Lign# 5"
The command use ifto check conditions:
== for an equivalence!> for a difference<= / >= less/greater than or equal
to</ > strictly less/greater than& / || AND / OR ==> for the
multiple conditionsisTRUE(x) test if X is TRUEna.rm=TRUE to remove the empty valuesif (condition 1){
action 1
}else if ((condtion 2) || (condition 3)){
action 2
}else{
other things
}